Bayesian subset selection and variable importance for interpretable prediction and classification

Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model ℳ, we extract a family of near-optimal subsets of variables for linear prediction or classification. This strategy deemphasizes the role of a single “best” subset and instead advances the broader perspective that often many subsets are highly competitive. The acceptable family of subsets offers a new pathway for model interpretation and is neatly summarized by key members such as the smallest acceptable subset, along with new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. More broadly, we apply Bayesian decision analysis to derive the optimal linear coefficients for any subset of variables. These coefficients inherit both regularization and predictive uncertainty quantification via ℳ. For both simulated and real data, the proposed approach exhibits better prediction, interval estimation, and variable selection than competing Bayesian and frequentist selection methods. These tools are applied to a large education dataset with highly correlated covariates. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and identifies over 200 distinct subsets of variables that offer near-optimal out-of-sample predictive accuracy.


Introduction
Subset and variable selection are essential components of regression analysis, prediction, and classification.By identifying subsets of important covariates, the analyst can acquire simpler and more interpretable summaries of the data, improved prediction or classification, reduced estimation variability among the selected covariates, lower storage requirements, and insights into the factors that determine predictive accuracy (Miller, 1984).Classical subset selection is expressed as the solution to the constrained least squares problem min where y is an n-dimensional response, X is an n × p matrix of covariates, and β is the p-dimensional vector of unknown coefficients.Traditionally, the goal is to determine (i) the best subset of each size k, (ii) an estimate of the accompanying nonzero linear coefficients, and (iii) the best subset of any size.Subset selection has garnered additional attention recently, in part due to the algorithmic advancements from Bertsimas et al. (2016) and the detailed comparisons of Hastie et al. (2020).More broadly, variable selection has been deployed as a tool for interpretable machine learning, including for highly complex and nonlinear models (e.g., Ribeiro et al., 2016;Afrabandpey et al., 2020).
Although often considered the "gold standard" of variable selection, subset selection remains underutilized due to several critical limitations.First, subset selection is inherently unstable: it is common to obtain entirely distinct subsets under perturbations or resampling of the data.This instability undermines the interpretability of a single "best" subset, and is a significant motivating factor for the proposed methods.Second, the solutions to (1) are unregularized.While it is advantageous to avoid overshrinkage, Hastie et al. (2020) showed that the lack of any regularization in (1) leads to deteriorating performance relative to penalized regression in low-signal settings.Third, inference about β requires careful adjustment for selection bias, which limits the available options for uncertainty quantification.Finally, solving (1) is computationally demanding even for moderate p, which has spawned many algorithmic advancements spanning multiple decades (e.g., Furnival and Wilson, 2000;Gatu and Kontoghiorghes, 2006;Bertsimas et al., 2016).For these reasons, penalized regression techniques that replace the 0 -penalty with convex or nonconvex yet computationally feasible alternatives (Fan and Lv, 2010) are often preferred.
From a Bayesian perspective, (1) is usually translated into a predictive model via a Gaussian (log-) likelihood and a sparsity (log-) prior.Indeed, substantial research efforts have been devoted to both sparsity (e.g., Ishwaran and Rao, 2005) and shrinkage (e.g., Polson and Scott, 2010) priors for β.Yet the prior alone cannot select subsets: the prior is the component of the data-generating process that incorporates prior beliefs, information, or regularization, while selection is ultimately a decision problem (Lindley, 1968).For instance, Barbieri and Berger (2004) and Liang et al. (2013) applied sparsity priors to obtain posterior inclusion probabilities, which were then used for marginal selection and screening, respectively.Jin and Goh (2020) selected subsets using marginal likelihoods, but required conjugate priors for Gaussian linear models.Hahn and Carvalho (2015) more fully embraced the decision analysis approach for variable selection, and augmented a squared error loss with an 1 -penalty for linear variable selection.Alternative loss functions have been proposed for seemingly unrelated regressions (Puelz et al., 2017), graphical models (Bashir et al., 2019), nonlinear regressions (Woody et al., 2020), functional regression (Kowal and Bourgeois, 2020), time-varying parameter models (Huber et al., 2020), and a variety of Kullback-Leibler approximations to the likelihood (Goutis and Robert, 1998;Nott and Leng, 2010;Tran et al., 2012;Piironen et al., 2020).
Despite the appropriate deployment of decision analysis for selection, each of these Bayesian methods relies on 1 -penalization or forward search.As such, they are restricted to limited search paths that cannot fully solve the (exhaustive) subset selection problem.More critically, these Bayesian approaches-as well as most classical ones-are unified in their emphasis on selecting a single "best" subset.However, in practice it is common for many subsets (or models) to achieve near-optimal predictive performance, known as the Rashomon effect (Breiman, 2001).This effect is particularly pronounced for correlated covariates, weak signals, or small sample sizes.Under these conditions, it is empirically and theoretically possible for the "true" covariates to be predictively inferior to a proper subset (Wu et al., 2007).As a result, the "best" subset is not only less valuable but also less interpretable.Reporting a single subset-or a small number of subsets along a highly restricted search path-obscures the likely presence of many distinct yet equally-predictive subsets of variables.
We advance an agenda that instead curates and summarizes a family of optimal or near-optimal subsets.This broader analysis alleviates the instability issues of a single "best" subset and provides a more complete predictive picture.The proposed approach operates within a decision analysis framework and is compatible with any Bayesian model M for prediction or classification.Naturally, M should represent the modeler's beliefs about the data-generating process and describe the salient features in the data.Several key developments are required: The importance of curating and exploring a collection of subsets has been acknowledged previously.Existing approaches are predominantly frequentist, including fence methods (Jiang et al., 2008), Rashomon sets (Semenova and Rudin, 2019), bootstrapped confidence sets (Lei, 2019), and subsampling-based forward selection (Kissel and Mentch, 2021).Although the acceptable family has appeared previously for Bayesian decision analysis (Kowal, 2021;Kowal et al., 2020), it was applied only along the 1 -path which does not enumerate a sufficiently rich collection of competitive subsets.Further, previous applications of the acceptable family did not address points 1., 2., and 4. above.
The paper is outlined as follows.Section 2 contains the Bayesian subset search procedures, the construction of acceptable families, the (co-) variable importance metrics, and the predictive uncertainty quantification.Section 3 details the simulation study.The methodology is applied to a large education dataset in Section 4. We conclude in Section 5.The Appendix provides additional algorithmic details, simulation studies, and results from the application.An R package is available online at https://github.com/drkowal/BayesSubsets.Although the education dataset (Children's Environmental Health Initiative, 2020) cannot be released due to privacy protections, access to the dataset can occur through establishing affiliation with the Children's Environmental Health Initiative (contact cehi@nd.edu).

Predictive decision analysis
Decision analysis establishes the framework for extracting actions, estimators, and predictions from a Bayesian model (e.g., Bernardo and Smith, 2009).These tools translate probabilistic models into practical decision-making and can be deployed to summarize or interpret complex models.However, additional methodology is needed to convert subset selection into a decision problem, and further to evaluate and collect many near-optimal subsets.
Let M denote any Bayesian model with a proper posterior distribution p M (θ|y) and posterior predictive distribution p M (ỹ|y) := p M (ỹ|θ, y)p M (θ|y)dθ, where θ denotes the parameters of the model M and y is the observed data.Informally, p M (ỹ|y) defines the distribution of future or unobserved data ỹ conditional on the observed data y and according to the model M. Decision analysis evaluates each action δ based on a loss function L(ỹ, δ) that enumerates the cost of each action when ỹ is realized.Examples include point prediction (e.g., squared error loss) or classification (e.g., cross-entropy loss), interval estimation (e.g., minimum length subject to 1 − α coverage), and selection among a set of hypotheses (e.g., 0-1 loss).Since ỹ is unknown yet modeled probabilistically under M, an optimal action minimizes the posterior predictive expected loss with the expectation taken under the Bayesian model M. The operation in (2) averages the predictive loss over the possible realizations of ỹ according to the posterior probability under M and then minimizes the resulting quantity over the action space.Yet without careful specification of the loss function L, (2) does not provide a clear pathway for subset selection.To see this, let ỹ(x) denote the predictive variable at covariate x.For point prediction of ỹ(x) under squared error loss, the optimal action is i.e., the posterior predictive expectation at x. Similarly, for classification of ỹ(x) ∈ {0, 1} under cross-entropy loss (see ( 10)), the optimal action is the posterior predictive probability δ(x) = p M {ỹ(x) = 1|y}.For a generic model M, there is not necessarily a closed form for δ(x): these actions are computed separately for each x with no clear mechanism for inducing sparsity or specifying distinct subsets.Hence, additional techniques are needed to supply actions that are not only optimal but also selective and interpretable.
Note that we use observation-driven rather than parameter-driven loss functions.Unlike the parameters θ, which are unobservable and model-specific, the predictive variables ỹ are observable and directly comparable across distinct Bayesian models.A decision analysis based on ỹ operates on the same scale and in the same units as the data that have been-and will be-observed, which improves interpretability.Perhaps most important, an observation-driven decision analysis enables empirical evaluation of the selected actions.

Subset search for linear prediction
The subset search procedure is built within the decision framework of (2).For any Bayesian model M, we consider linear actions δ(x) = x δ with δ ∈ R p , which offer both interpretability and the capacity for selection.Let δ S denote the linear action with zero coefficients for all j ∈ S, where S ⊆ {1, . . ., p} is a subset of active variables.For prediction, we assemble the aggregate and weighted squared error loss (3) where ỹi := ỹ(x i ) is the predictive variable at xi for each i = 1, . . ., ñ and ω(x i ) > 0 is a weighting function.The covariate values {x i } ñ i=1 can be distinct from the observed covariate values {x i } n i=1 , for example to evaluate the action for a different population.The loss in (3) evaluates linear coefficients δ S for any given subset S by accumulating the squared error loss over the covariate values {x i } ñ i=1 .Since (3) depends on {ỹ i } ñ i=1 , the loss inherits a joint predictive distribution p M (ỹ 1 , . . ., ỹñ |y) from M. The loss is decoupled from the Bayesian model M: the linear action does not require a linearity assumption for M. The weights ω(x i ) can be used to target actions locally, which provides a sparse and local linear approximation to M. For example, we might parametrize the weighting function as ω(x i ) ∝ exp(− xi − x * 2 2 / ) with range parameter in order to weight based on proximity to some particular x * of interest (Ribeiro et al., 2016).Alternatively, ω can be specified via a probability model for the likelihood of observing each covariate value, including ω(x i ) = ñ−1 as a simple yet useful example, especially when using the observed covariate values {x i } ñ i=1 = {x i } n i=1 ; this is our default choice.Crucially, the optimal action can be solved directly for any subset S: The optimal action (2) for the loss where ŷi := E ỹ|y {ỹ(x i )}, ŷ := (ŷ 1 , . . ., ŷñ ) , Ω := diag{ω(x i )} ñ i=1 , and XS is the ñ × |S| matrix of the active covariates in {x i } ñ i=1 for subset S.
, where the first term is a (finite) constant that does not depend on δ S .The remaining steps constitute a weighted least squares solution.
The consequence of Lemma 1 is that the optimal action for each subset S is simply the (weighted) least squares solution based on pseudo-data ( XS , ŷ)-i.e., a "fit to the fit" from M. The advantages of M can be substantial: the Bayesian model propagates regularization (e.g., shrinkage, sparsity, or smoothness) to the point predictions ŷ, which typically offers sizable improvements in estimation and prediction relative ordinary (weighted) least squares.This effect is especially pronounced in the presence of high-dimensional (large p) or correlated covariates.The optimal action may be non-unique if X S Ω XS is noninvertible, in which case the inverse in (5) can be replaced by a generalized inverse.
At this stage, the Bayesian model M is only needed to supply the pseudo-response variable ŷi ; different choices of M will result in distinct values of ŷi and therefore distinct actions δS .An illuminating special case occurs for linear regression: Corollary 2 For the linear regression model M with E ỹ|θ {ỹ(x)} = x β and any set of covariate values {x i } ñ i=1 and weights ω(x i ) > 0, the optimal action (2) under (3) for the full set of covariates is δ{1,...,p} = β, where β := E θ|y β.
Depending on the choice of {x i } ñ i=1 and ω, δ{1,...,p} may be non-unique.Corollary 2 links the optimal action to the model parameters: the posterior expectation β is also the optimal action under the parameter-driven squared error loss Similarly, a linear model for M implies that ŷi = x i β, so the optimal action (5) for any subset S is intrinsically connected to the (regression) model parameters.This persists for other regression models as well.By contrast, these restrictions also illustrate the generality of (3)-( 5): the optimal linear actions are derived explicitly under any model M (with E ỹ|y ỹ(x i ) 2 2 < ∞) and using any set of covariate values {x i } ñ i=1 , active covariates S ⊆ {1, . . ., p}, and weighting functions ω(x) > 0.
The critical remaining challenge is optimization-or at least evaluation and comparisonamong the possible subsets S. Our strategy emerges from the observation that there may be many subsets that achieve near-optimal predictive performance, often referred to as the Rashomon effect (Breiman, 2001).The goal is to collect, characterize, and compare these near-optimal subsets of linear predictors.Hence, there are two core tasks: (i) identify candidate subsets and (ii) filter to include only those subsets that achieve near-optimal predictive performance.These tasks must overcome both computational and methodological challenges-similar to classical (non-Bayesian) subset selection-which we resolve in the subsequent sections.
An exhaustive enumeration of all possible subsets presents an enormous computational burden, even for moderate p.Although tempting, it is misguided to consider direct optimization over all possible subsets of {1, . . ., p}, for the aggregate squared error loss (3).To see this-and find suitable alternatives-consider the following result: Lemma 3 Let RSS(y, µ) := y − µ 2 2 and ŷ := E ỹ|y ỹ, and suppose E ỹ|y ỹ 2 2 < ∞.For any point predictors µ 1 and µ 2 , we have 2 is finite and does not depend on µ, the ordering of µ 1 and µ 2 will be identical whether using E ỹ|y {RSS(ỹ, µ)} or RSS(ŷ, µ).
Now recall the optimization of (6).For any subset S, the optimal action is the least squares solution (5) with pseudo-data ŷ.However, RSS in linear regression is ordered by nested subsets: RSS(ŷ, X δS 1 ) ≤ RSS(ŷ, X δS 2 ) whenever S 2 ⊆ S 1 .By Lemma 3, it follows that the solution of ( 6) is simply δ S = ( X Ω X) −1 X Ωŷ, S = {1, . . ., p} for X = (x 1 , . . ., xñ ) .As with (5), a generalized inverse can be substituted if necessary.The main consequence is that the optimal actions in (4) and ( 6) alone cannot select variables or subsets: (4) provides the optimal action for a given subset S, while (6) trivially returns the full set of covariates.Despite the posterior predictive expectation in (6), this optimality is only valid in-sample and is unlikely to persist for out-of-sample prediction.Hence, this optimality is unsatisfying.
Yet Lemma 3 provides a path forward.Rather than fixing a subset S in (4) or optimizing over all subsets in (6), suppose we compare among subsets of a fixed size k < p. Equivalently, this constraint can be representation as an 0 -penalty augmentation to the loss function (3), i.e., δk := arg min where the inequality follows from Lemma 3. In direct contrast with previous approaches for Bayesian variable selection via decision analysis (e.g., Hahn and Carvalho, 2015;Woody et al., 2020;Kowal et al., 2020) we do not use convex relaxations to 1 -penalties, which create unnecessarily restrictive search paths and introduce additional bias in the coefficient estimates.
Under the loss (3), the solution to (8) reduces to δk = arg min using the same argument as Lemma 1.In particular, the solution δk resembles classical subset selection (1), but uses the fitted values ŷi from M instead of the data y and further generalizes to include possibly distinct covariates {x i } and weights {ω(x i )}.
Because of the representation in (9), we can effectively solve (8) by leveraging existing algorithms for subset selection.However, our broader interest in the collection of nearoptimal subsets places greater emphasis on the search and filtering process.For any two subsets S 1 and S 2 of equal size ).Therefore, we can order the linear actions from (4) among all equally-sized subsets simply by ordering the values of RSS(ŷ, Xδ S ).This result resembles the analogous scenario in classical linear regression on {(x i , y i )} n i=1 : subsets of fixed size maintain the same ordering whether using RSS or information criteria such as AIC, BIC, or Mallow's C p .Here, the criterion of interest is the posterior predictive expected RSS, E ỹ|y {RSS(ỹ, Xδ S )}, and the RSS reduction occurs with the model M fitted values ŷi = E ỹ|y ỹ(x i ) serving as pseudo-data.
Because of this RSS-based ordering among equally-sized linear actions, we can leverage the computational advantages of the branch-and-bound algorithm (BBA) for efficient subset exploration (Furnival and Wilson, 2000).Using a tree-based enumeration of all possible subsets, BBA avoids an exhaustive subset search by carefully eliminating non-competitive subsets (or branches) according to RSS.BBA is particularly advantageous for (i) selecting s max ≤ p covariates, (ii) filtering to m k ≤ p k subsets for each size k, and (iii) exploiting the presence of covariates that are almost always present for low-RSS models (Miller, 1984).In the present setting, the key inputs to the algorithm are the covariates {x i }, the pseudo-data {ŷ i }, and the weights {ω(x i )}.In addition, we specify the maximum number of subsets m k ≤ p k to return for each size k.As m k increases, BBA returns more subsets-with higher RSS-yet computational efficiency deteriorates.We consider default values of m k = 100 and m k = 15 and compare the results in Section 4.An efficient implementation of BBA is available in the leaps package in R. Note that our framework is also compatible with many other subset search algorithms (e.g., Bertsimas et al., 2016).
In the case of moderate to large p, we screen to s max ≤ p covariates using the original model M. Specifically, we select the s max covariates with the largest absolute (standardized) linear regression coefficients.When M is a nonlinear model, we use the optimal linear coefficients δS on the full subset S = {1, . . ., p} of (standardized) covariates.The use of marginal screening is common in both frequentist (Fan and Lv, 2008) and Bayesian (Bondell and Reich, 2012) high-dimensional linear regression models, with accompanying consistency results in each case.Here, sceening is motivated by computational scalability and interpretability: BBA is quite efficient for p ≤ 35 and m k ≤ 100, while the interpretation of a subset of covariates-acting jointly to predict accurately-is muddled as the subset size increases.By default, we fix s max = 35.We emphasize that although this screening procedure relies on marginal criteria, it is based on a joint model M that incorporates all p covariates.In that sense, our screening procedure resembles the most popular Bayesian variable selection strategies based on posterior inclusion probabilities (Barbieri and Berger, 2004) or hard-thresholding (Datta and Ghosh, 2013).

Subset search for logistic classifiers
Classification and binary regression operate on {0, 1}, rendering the squared error loss (3) unsuitable.Consider a binary predictive functional h(ỹ) ∈ {0, 1} where ỹ ∼ p M (ỹ|y).In this framework, binarization can come from one of two sources.Most common, the data are binary y i ∈ {0, 1}, paired with the identity functional h(ỹ) = ỹ, and M is a Bayesian classification (e.g., probit or logistic regression) model.Less common, non-binary data can be modeled via M and paired with a functional h that maps to {0, 1}.For example, we may be interested in selecting variables to predict exceedance of a threshold, h(ỹ) = I{ỹ ≥ τ }, for some τ based on real-valued data y.The latter case is an example of targeted prediction (Kowal, 2021), which customizes posterior summaries or decisions for any functional h.This approach is distinct from fitting separate models to each empirical functional {h(y i )} n i=1which is still compatible with the first setting above-and instead requires only a single Bayesian reference model M for all target functionals h.
For classification or binary prediction of h(ỹ) ∈ {0, 1}, we replace the squared error loss (3) with the aggregate and weighted cross-entropy loss, where h(ỹ i ) ∈ {0, 1} is the predictive variable at xi under M and π S ( The cross-entropy is also the deviance or negative log-likelihood of a series of independent Bernoulli random variables h(ỹ i ) each with success probability π S (x i ) for i = 1, . . ., ñ.However, (10) does not imply a distributional assumption for the decision analysis; all distributional assumptions are encapsulated within M, including the posterior predictive distribution of h(ỹ i ).As before, δ S is the linear action with zero coefficients for all j ∈ S, where S ⊆ {1, . . ., p} is a subset of active variables.The optimal action (2) is obtained for each subset S by computing expectations with respect to the posterior predictive distribution under M and minimizing the ensuing quantity.As in the case of squared error loss, key simplifications are available: where ĥi := E ỹ|y h(ỹ i ) is the posterior predictive expectation of h(ỹ i ) under M. The representation in ( 11) is quite useful: it is the negative log-likelihood for a logistic regression model with pseudo-data {(x i , ĥi )} ñ i=1 .Standard algorithms and software, such as iteratively reweighted least squares (IRLS) in the glm package in R, can be applied to solve (11) for any subset S.
Instead of fitting a logistic regression to the observed binary variables h(y i ) ∈ {0, 1}, the optimal action under cross-entropy (10) fits to the posterior predictive probabilities ĥi = p M {h(ỹ i ) = 1|y} ∈ [0, 1] under M.For a well-specified model M, these posterior probabilities ĥi can be more informative than the binary empirical functionals h(y i ): the former lie on a continuum between the endpoints zero and one.Furthermore, for non-degenerate models M with ĥi ∈ (0, 1), the optimal action (11) resolves the issue of separability, which is a persistent challenge in classical logistic regression.
Despite the efficiency of IRLS for a fixed subset S, the computational savings of BBA for subset search rely on the RSS from a linear model.As such, solving (11) for all or many subsets incurs a much greater computational cost.Yet IRLS is intrinsically linked with RSS.At convergence, IRLS obtains a weighted least squares solution where and pseudo-data with z := (z 1 , . . ., z ñ) .By design, the weighted least squares objective associated with ( 12) is a second-order Taylor approximation to the predictive expected cross-entropy loss: The weighted least squares approximation in ( 14) summons BBA for subset search.Hosmer et al. (1989) adopted this strategy for subset selection in classical logistic regression on y i ∈ {0, 1}.Here, both the goals and the optimization criterion are distinct: we are interested in curating a collection of near-optimal subsets-rather than selecting a single "best" subset-and the weighted least squares objective ( 14) inherits the fitted probabilities ĥi from the Bayesian model M along with the weights ω(x i ).
Ideally, we might apply BBA directly based on the covariates {x i }, the pseudo-data {z i }, and the weights {w i }.However, both z i and w i depend on πS (x i ) and therefore are subset-specific.As a result, BBA cannot be applied without significant modifications.A suitable alternative is to construct subset-invariant psuedo-data and weights by replacing πS (x i ) with the corresponding estimate from the full model, ĥi .Specifically, let both of which depend on M rather than the individual subsets S. The pseudo-data ẑi is defined similarly to z i in ( 13), where the second term now cancels.Finally, BBA subset search can be applied using the covariates {x i }, the pseudo-data {ẑ i }, and the weights { ŵi }.
As for squared error, we restrict each subset size to m k = 100 or m k = 15 for all subsets of size k = 1, . . ., p.Despite the critical role of the weighted least squares approximation in ( 14) for subset search, all further evaluations and comparisons rely on the exact cross-entropy loss (10).

Predictive evaluations for identifying near-optimal subsets
The BBA subset search filters the 2 p possible subsets to the m k best subsets for each size k = 1, . . ., s max according to posterior predictive expected loss using the weighted squared error loss for prediction (Section 2.2) or the cross-entropy loss for classification (Section 2.3).However, as noted below Lemma 3, further comparisons based on these expected losses are trivial: the full set S = {1, . . ., p} always obtains the minimum, which precludes variable selection.Selection of a single "best" subset of each size k invites additional difficulties: if multiple subsets perform similarly-which is common for correlated covariates-then selecting m k = 1 subset will not be robust or stable against perturbations of the data.
Equally important, restricting to m k = 1 subset is blind to competing subsets that offer similar predictive performance yet may differ substantially in the composition of covariates; see Sections 3-4.These challenges persist for both classical and Bayesian approaches.
We instead curate and explore a collection of near-optimal subsets.The notion of "nearoptimal" derives from the acceptable family of Kowal (2021).Informally, the acceptable family uses out-of-sample predictive metrics to gather those predictors that match or nearly match the performance of the best out-of-sample predictor with nonnegligible posterior predictive probability under M. The out-of-sample evaluation uses a modified K-fold crossvalidation procedure.Let I k ⊂ {1, . . ., n} denote the kth validation set, where each data point appears in (at least) one validation set, ∪ K k=1 I k = {1, . . ., n}.By default, we use K = 10 validation sets that are equally-sized, mutually exclusive, and selected randomly from {1, . . ., n}.Define an evaluative loss function L(y, x δS ) for the optimal linear coefficients of subset S, and let S denote the collection of subsets obtained from the BBA filtering process.Typically, the evaluative loss L is identical to the loss L used for optimization, but this restriction is not required.For each data split k and each subset S ∈ S, the out-of-sample empirical and predictive losses are respectively, where δ−I k is the posterior predictive distribution at x i conditional only on the training data.Averaging across all data splits, we obtain The distinction between the empirical loss L out S and the predictive loss L out S is important.The empirical loss is a point estimate of the risk under predictions from δS based on the data y.By comparison, the predictive loss provides a distribution of the out-of-sample loss based on the model M.Both are valuable: L out S is entirely empirical and captures the classical notion of K-fold cross-validation, while L out S leverages the Bayesian model to propagate the uncertainty from the model-based data-generating process.
For any two subsets S 1 and S 2 , consider the percent increase in out-of-sample predictive loss from δS 1 to δS 2 : Since D out S 1 ,S 2 inherits a predictive distribution under M, we can leverage the accompanying uncertainty quantification to determine whether the predictive performances of δS 1 and δS 2 are sufficiently distinguishable.In particular, we are interested in comparisons to the best subset for out-of-sample prediction, so that δS min is the optimal linear action associated with the subset S min that minimizes the empirical K-fold cross-validated loss.Unlike the RSS-based in-sample evaluations from Section 2.2, the subset S min can-and usually will-differ from the full set {1, . . ., p}, which enables variable selection driven by out-of-sample predictive performance.
Using S min as an anchor, the acceptable family broadens to include near-optimal subsets: With ( 19), we collect all subsets S that perform within a margin η ≥ 0% of the best subset, D out S min ,S < η, with probability at least ε ∈ [0, 1].Equivalently, a subset S is acceptable if and only if there exists a lower (1 − ε) posterior prediction interval for D out S min ,S that includes η.Hence, unacceptable subsets are those for which there is insufficient predictive probability (under M) that the out-of-sample accuracy of S is within a predetermined margin of the best subset.The acceptable family is nonempty, since S min ∈ A η,ε , and is expanded by increasing η or decreasing ε.By default, we select η = 0% and ε = 0.10 and assess robustness in the simulation study (see also Kowal, 2021;Kowal et al., 2020).
Within the acceptable family, we isolate two subsets of particular interest: the best subset S min from ( 18) and the smallest acceptable subset, When S small is nonunique, so that multiple subsets of the same minimal size are acceptable, we select from those subsets the one with the smallest empirical loss L out S .By definition, S small is the smallest set of covariates that satisfies the near-optimality condition in (19).As noted by Hastie et al. (2009) and others, selection based on minimum cross-validated error, such as S min , often produces models or subsets that are more complex than needed for adequate predictive accuracy.The acceptable family A η,ε , and in particular S small , exploits this observation to provide alternative-and often much smaller-subsets of variables.
To ease the computational burden, we adapt the importance sampling algorithm from Kowal (2021) to compute ( 19) and all constituent quantities (see Appendix A).Crucially, this algorithm is based entirely on the in-sample posterior distribution under model M, which avoids both (i) the intensive process of re-fitting M for each data split k and (ii) data reuse that adversely affects downstream uncertainty quantification.Briefly, the algorithm uses the complete data posterior p M (θ|y) as a proposal for the training data posterior p M (θ|y −I k ).The importance weights are then computed using the likelihood p M (y I k |θ) under model M. Similar algorithms have been deployed for Bayesian model selection (Gelfand et al., 1992), evaluating prediction distributions (Vehtari and Ojanen, 2012), and 1 -based Bayesian variable selection (Kowal et al., 2020).
In the case of new (out-of-sample) covariate values {x i } ñ i=1 = {x i } n i=1 , the predictive loss may be defined without cross-validation: , where ỹi := ỹ(x i ) is the predictive variable at xi for each i = 1, . . ., ñ.The empirical loss is undefined without a corresponding observation of the response variable for each xi .Hence, we modify the acceptable family (19) to instead reference the full subset S = {1, . . ., p} in place of S min , which is no longer defined.When M is a linear model, Corollary 2 implies that the corresponding reference is simply the posterior predictive expectation under M.

Co-variable importance
Although it is common to focus on a single subset for selection, the acceptable family A η,ε provides a broad assortment of competing explanations: linear actions with distinct sets of covariates that all provide near-optimal predictive accuracy.Hence, we seek to further summarize A η,ε beyond S min and S small to identify (i) which covariates appear in any acceptable subset, (ii) which covariates appear in all acceptable subsets, and (iii) which covariates appear together in the same acceptable subsets.
For covariates j and , the sample proportion of joint inclusion in A η,ε achieves each of these goals: and measures (co-) variable importance.Naturally, ( 21) is generalizable to more than two covariates, but is particularly interesting for a single covariate: VI incl (j) := VI incl (j, j) is the proportion of acceptable subsets to which covariate j belongs.When VI incl (j) > 0, covariate j belongs to at least one acceptable subset.Such a result does not imply that covariate j is necessary for accurate prediction, but rather that covariate j is a component of at least one near-optimal linear subset.When VI incl (j) = 1, we refer to covariate j as a keystone covariate: it belongs to all acceptable subsets and therefore is deemed essential.For j = , VI incl (j, ) highlights not only the covariates that co-occur, but also the covariates that rarely appear together.It is particularly informative to identify covariates j and such that VI incl (j) and VI incl ( ) are both large yet VI incl (j, ) is small.In that case, covariates j and are both important yet also redundant, as might be expected for highly correlated variables.
Although VI incl is defined based on linear predictors for each subset S, the variable importance metric applies broadly to (possibly nonlinear) Bayesian models M and a variety of evaluative loss functions L, and can be targeted locally via the weights ω(x i ).The inclusion-based metric VI incl can be extended to incorporate effect size, which is a more common strategy for variable importance.Related, Dong and Rudin (2019) aggregated model-specific variable importances across many "good models".Alternative approaches use leave-one-covariate-out predictive metrics (e.g., Lei et al., 2018), but are less appealing in the presence of correlated covariates.

Posterior predictive uncertainty quantification
A persistent challenge in classical subset selection is the lack of accompanying uncertainty quantification.Given a subset Ŝ selected using the data (X, y), familiar frequentist and Bayesian inferential procedures applied to (X Ŝ , y) are in general no longer valid.In particular, we cannot simply proceed as if only the selected covariates Ŝ were supplied from the onset.Such analyses are subject to selection bias (Miller, 1984).
A crucial feature of our subset filtering (Section 2.2 and Section 2.3) and predictive evaluation (Section 2.4) techniques is that, despite the broad searching and the out-ofsample targets, these quantities all remain in-sample posterior functionals under M.There is no data re-use or model re-fitting: every requisite term is a functional of the complete data posterior p M (θ|y) or p M (ỹ|y) from a single Bayesian model.Hence, the posterior distribution under M remains a valid facilitator of uncertainty quantification.
We elicit a posterior predictive distribution for the action by removing the expectation in (2): which no longer integrates over p M (ỹ|y) and hence propagates the posterior predictive uncertainty.For the squared error loss (3), the predictive action is akin to (5), where ỹ = (ỹ 1 , . . ., ỹñ ) ∼ p M (ỹ|y).The linear coefficients δS inherit a posterior predictive distribution from ỹ and can be computed for any subset S. Similar computations are available for the cross-entropy loss (10).In both cases, draws {ỹ s } S s=1 ∼ p M (ỹ|y) from the posterior predictive distribution are sufficient for uncertainty quantification of δS .Under the usual assumption that p M (ỹ|θ, y) = p M (ỹ|θ), these draws are easily obtained by repeatedly sampling θ s ∼ p M (θ|y) from the posterior and ỹs ∼ p M (ỹ|θ = θ s ) from the likelihood.
The predictive actions are computable for any subset S, include those selected based on the predictive evaluations in Section 2.4.Let Ŝ denote a subset identified based on the posterior (predictive) distribution under M, such as S min or S small .The predictive action δ Ŝ using this subset in ( 23) is a posterior predictive functional.Unlike for a generic subset S, the predictive action δ Ŝ is a functional of both ỹ and Ŝ, which factors into the interpretation.
A similar projection-based approach is developed by Woody et al. (2020) for Gaussian regression.In place of L(ỹ, δ) in ( 22), Woody et al. (2020) suggest E ỹ|θ L(ỹ, δ), which is a middle ground between (2) and ( 22): it integrates over the uncertainty of ỹ given model parameters θ, but preserves uncertainty due to θ.For example, under the linear regression model E ỹ|θ {ỹ(x)} = x β, the analogous result in ( 23) is ( X S Ω XS ) −1 X S Ω Xβ; when S = {1, . . ., p}, this simplifies to the regression coefficient β.Both approaches have their merits, but we prefer ( 22) because of the connection to the observable random variables ỹ rather than the model-specific parameters θ.

Simulation study
We conduct an extensive simulation study to evaluate competing subset selection techniques for prediction accuracy, uncertainty quantification, and variable selection.Prediction for real-valued data is discussed here; classification for binary data is presented in Appendix B. Although we do evaluate individual subsets from the proposed framework-namely, S min and S small -we are more broadly interested in the performance of the acceptable family of subsets.In particular, the acceptable family is designed to collect many subsets that offer near-optimal prediction; both cardinality and aggregate predictive accuracy are critical.
First, we evaluate point prediction accuracy across competing collections of near-optimal subsets.Each collection is built from the same Gaussian linear regression model with horseshoe priors (Carvalho et al., 2010) M and estimated using bayeslm (Hahn et al., 2019).The collections are generated from different candidate sets S based on distinct search methods: the proposed BBA method (bbound(bayes)), the adaptive lasso search (lasso(bayes)) proposed by Hahn and Carvalho (2015) and Kowal et al. (2020), and both forward (forward(bayes)) and backward (backward(bayes)) search.For each collection of candidate subsets, we compute the acceptable families for η = 0%, ε = 0.1, m k = 15, and the observed covariate values {x i } ñ i=1 = {x i } n i=1 ; variations of these specifications are presented in Appendix C.These methods differ only in the search process: if a particular subset S is identified by more than one of these competing methods, the corresponding point predictions-based on the optimal action δS in (5)-will be identical.
The competing collections of near-optimal subsets are evaluated in aggregate.At each simulation, we compute the qth quantile of the root mean squared errors (RMSEs) for the true mean {y * i } n i=1 across all subsets within that acceptable family, and then average that quantity across all simulations.For example, q = 1 describes the predictive performance if we were to use the worst acceptable subset at each simulation, as determined by an oracle.These quantities summarize the distribution of RMSEs within each acceptable family; we report the values for q ∈ {0, 0.25, 0.5, 0.75, 1} and present the results as boxplots in Figure 1.Most notably, the proposed bbound(bayes) strategy produces up to 10 times the number of acceptable subsets as the other methods, yet crucially maintains equivalent predictive performance.Naturally, this expanded collection of subsets also produces a minimum RMSE (q = 0) that outperforms the remaining methods.Clearly, the proposed approach is far superior at collecting more subsets that provide near-optimal predictive accuracy.
Figure 1 also summarizes the point prediction accuracy for several competing methods.First, we include the usual point estimate under M given by the posterior expectation of the regression coefficients β (post.mean),which does not include any variable or subset selection.Next, we compute point predictions for the key acceptable subsets S min and S small using the proposed bbound(bayes) approach; the competing search strategies produced similar results and are omitted.Among frequentist estimators, we use the adaptive lasso (lasso; Zou, 2006) with λ chosen by 10-fold cross-validation and the one-standard-error rule (Hastie et al., 2009).In addition, we include classical subset selection (subset) implemented using the leaps package in R with the final subset determined by AIC.When p > 35, we screen to the first 35 predictors that enter the model in the aforementioned adaptive lasso.Forward (forward) and backward (backward) selection were also considered, but were either noncompetitive or performed similarly to the adaptive lasso and are omitted from this plot.The boxplots summarize the RMSE quantiles for the subsets within each acceptable family, while the vertical lines denote RMSEs of competing methods.The average size of each acceptable family is annotated.The proposed BBA search returns vastly more subsets that remain highly competitive, while S small performs very well and substantially outperforms classical subset selection.
Most notably, the predictive performance of S small is excellent, especially for high SNRs, and substantially outperforms the frequentist methods in all settings.Because S min is overly conservative-i.e., it selects too many variables (see Table 1)-it performs nearly identically to post.mean.Although S min and post.meanoutperform S small when the SNR is low and the sample size is small, note that S small is not targeted exclusively for optimal predictive performance; rather, it represents the smallest subset that obtains near-optimal performance.Lastly, the regularization induced by M is greatly beneficial: every member of each acceptable family decisively outperforms classical subset selection across all scenarios.
Next, we compare uncertainty quantification for the regression coefficients β * .We compute 90% intervals using the predictive action δS from (23) for S small and S min based on the proposed bbound(bayes) procedure.In addition, we compute 90% highest posterior density (HPD) credible intervals for β under M (post.HPD) and 90% frequentist confidence intervals using Zhao et al. (2017), which computes confidence intervals from a linear regression model that includes only the variables selected by the (adaptive) lasso (lasso+lm).The 90% interval estimates are evaluated and compared in Figure 2 using interval widths and empirical coverage.The intervals from S small using ( 22) are uniformly better (i.e., narrower) than the usual posterior HPD intervals under M.In addition, the intervals from Zhao et al. (2017) are highly competitive, despite ignoring selection bias.In all cases, the intervals are overly conservative and achieve the nominal empirical coverage.Lastly, marginal variable selection is evaluated using true positive rates and true negative rates in Table 1.In addition to S min and S small , we include the common Bayesian selection technique that includes a variable j when the 95% HPD interval for β j excludes zero (posterior HPD).The selection performance of S small is excellent and similar to the the adaptive lasso.Classical subset, forward, and backward selection are too aggressive, while the posterior interval-based selection is too conservative.Hence, despite the primary focus on the collection of near-optimal subsets, the smallest acceptable subset S small is a key member with excellent prediction, uncertainty quantification, and marginal selection performance.
The appendix contains simulation studies for classification (Appendix B) and variations for ε ∈ {0.01, 0.1, 0.2}, {x i } ñ i=1 , and the distributions of {x i } n i=1 and {x i } ñ i=1 (Appendix C).All results are qualitatively similar to those presented here.

Subset selection for predicting educational outcomes
Childhood educational outcomes are affected by adverse environmental exposures, such as poor air quality and lead, as well as social stressors, such as poverty, racial residential isolation, and neighborhood deviation.We study childhood educational development using end-of-grade reading test scores for a large cohort of fourth grade children in North Carolina (Children's Environmental Health Initiative, 2020).The reading scores are accompanied by a collection of student-level information detailed in Figure 3, which includes air quality exposures, birth information, blood lead measurements, and social and economic factors (see also Kowal et al., 2020).The goal is to identify subsets of these variables and interactions that offer near-optimal prediction of reading scores as well as accurate classification of at-risk students.A prominent feature of the data is the correlation among the covariates.After centering and scaling the continuous covariates to mean zero and standard deviation 0.5, we augment the variables in Figure 3 (excluding the test scores) with interactions between race and the social and economic factors.The resulting dimensions are n = 16806 and p = 32.Figure E.1 displays the pairwise correlations among the covariates and the response variable.There are strong associations among the air quality exposures as well as among race and the social and economic factors.Due to the dependences among variables, it is likely that distinct subsets of similar predictive ability can be obtained by interchanging among these correlated covariates.Hence, it is advantageous to collect and study the near-optimal subsets.We compute acceptable families for (i) the reading scores y i under squared error loss and (ii) the indicator h(ỹ i ) = I{ỹ i ≥ τ 0.1 } under cross-entropy loss, where τ 0.1 is the 0.1-quantile of the reading scores (see Appendix D).While task (i) broadly considers the spectrum of educational outcomes via reading scores, task (ii) targets at-risk students in the bottom 10% of reading ability.Acceptable families and accompanying quantities for both tasks can be computed using the same Bayesian model M: we focus on Gaussian linear regression with horseshoe priors.The acceptable families are computed using the proposed BBA search with η = 0%, ε = 0.1, m k = 100 and {x i } ñ i=1 = {x i } n i=1 ; results for other η values, ε = 0.2, and m k = 15 are noted, while alternative target covariates {x i } ñ i=1 are in Section 4.2.

Subset selection for predicting reading scores
First, we predict reading scores using a linear model for M and squared error loss for L.
Since acceptable family is defined based on D out S min ,S in ( 19), we summarize its distribution in Figure 4.For each S ∈ S, we display 80% intervals, expectations, and the empirical analog D out S,S min := 100 × (L out S − L out S min )/L out S min .The smaller subsets of sizes four to six demonstrate clear separation for certain subsets.Along with the intercept and the race indicators, these subsets include mEdu (college diploma), EconDisadv, and mEdu (completed high school) in sequence.However, larger subsets are needed to procure near-optimal predictions for smaller margins, such as η < 2%.While the best subset |S min | = 29 includes nearly all of the covariates, many subsets with |S| > 10 achieve within η = 1% of the accuracy of S min .S,Smin (xmarks) under squared error loss for each subset size |S| with S ∈ S. We annotate S min (dashed gray line) and S small (solid gray line) for ε = 0.10 and η = 0 and jitter the subset sizes for clarity of presentation.
Among the |S| = 2761 candidate subsets identified from the BBA search, there are |A 0,0.1 | = 1183 acceptable subsets.We summarize A η,ε via the co-variable importance metrics VI incl (j) and VI incl (j, ) in Figures 5 and E.2, respectively.Unlike many variable importance metrics that measure effect sizes, VI incl (j) instead quantifies whether each covariate j is a component of all, some, or no competitive subsets.There are many keystone covariates that appear in (nearly) all acceptable subsets, including environmental exposures (prenatal air quality and blood lead levels), economic and social factors (EconDisadv, mother's education level, neighborhood deprivation at time of test), and demographic information (race, gender), among others.(nearly) all, many (> 70%), some (> 30%), or no acceptable subsets.
Interestingly, chronic and acute PM 2.5 exposure each belong to nearly 50% of acceptable subsets (Figure 5), yet rarely appear in the same acceptable subset (Figure E.2).The pairwise correlations (Figure E.1) offer a reasonable explanation: these variables are weakly correlated with reading scores but highly correlated with one another.Similar results persist for neighborhood deprivation and racial residential isolation both at birth and time of test.Moreover, this analysis was conducted after removing one acute PM 2.5 outlier (50% larger than all other values).When that outlier is kept in the data, acute PM 2.5 no longer belongs to any acceptable subset, while VI incl (j) for chronic PM 2.5 increases.These results are encouraging: the acceptable family identifies redundant yet distinct predictive explanations, but prefers the more stable covariate in the presence of outliers.
Next, we analyze the smallest acceptable subset S small and incorporate uncertainty quantification for the accompanying linear coefficients.The |S small | = 19 selected covariates are displayed in Figure 6 alongside the point and 90% intervals based on the proposed approach, the Bayesian model M, and the adaptive lasso.The estimates and intervals for covariates excluded from S small are identically zero for the proposed approach (and, in this case, the adaptive lasso as well), while the estimates and HPD intervals from M are dense for all covariates.Despite the encouraging simulation results for the Zhao et al. (2017) frequentist intervals, these intervals often exclude the adaptive lasso point estimates, which undermines interpretability.
The estimates from S small and M are similar with anticipated directionality: higher mother's education levels, lower blood lead levels in the child, less neighborhood deprivation, and absence of economic disadvantages predict higher reading scores.Prenatal air quality exposure (PM 2.5 ) is significant: due to seasonal effects, the 1st and 3rd trimester exposures have negative coefficients, while the 2nd trimester has a positive effect.Naturally, these effects can only be interpreted jointly.Among interaction terms, the negative effect of NH Black × RI birth suggests the lower reading scores among non-Hispanic Black students are accentuated by racial residential isolation in the neighborhood of the child's birth.Since we do not force all main effects into each subset, S small does not contain an estimated effect of RI birth for other race groups.Other interactions, such as the positive effects of Hisp and NH Black by EconDisadv and Hisp × Blood lead, must also be interpreted carefully: the vast majority of Hispanic and non-Hispanic Black students belong to the EconDisadv group and have much higher blood lead levels on average, while each of NH Black, EconDisadv, and Blood lead has a strong negative main effect.The results are not particularly sensitive to m k or ε.When m k = 15, there are |S| = 436 candidate subsets and |A 0,0.1 | = 197 acceptable subsets.The variable importance metrics broadly agree with Figures 5 and E.2, while S small -and therefore Figure 6-is unchanged.When ε = 0.2 (and m k = 100 as before), the acceptable family reduces slightly to |A 0,0.1 | = 977 members and S small differs only in the addition of Smoker and Hisp × NDI test.

Out-of-sample prediction
We evaluate the predictive capabilities of the proposed approach for 20 training/testing splits of the NC education data.The same competing methods are adopted from Section 3, including the distinct search strategies for collecting near-optimal subsets.Since S small and S min are reasonably robust to m k , we select m k = 15 for computational efficiency.In addition, we include the acceptable family defined by setting {x i } ñ i=1 to be the testing data covariate values (bbound(Xtilde)), which is otherwise identical to bbound(bayes).Root mean squared prediction errors (RMSPEs) are used for evaluation.
The results are presented in Figure 7.Among single subset methods, S small outperforms all competitors-including the classical forward and backward estimators and the smallest acceptable subsets from lasso(bayes), forward(bayes), and backward(bayes) discussed in Section 3 (not shown).The adaptive lasso selects fewer variables and is not competitive.Among search methods, Figure 7 confirms the results from the simulation study: the proposed BBA strategy (bbound(bayes)) identifies 10-25 times the number of subsets as the other search strategies, yet does not sacrifice any predictive accuracy in this expanded collection.Clearly, bbound(bayes) provides a more complete predictive picture, as the competing search strategies omit a massive number of subsets that do offer near-optimal prediction.The acceptable family based on the out-of-sample covariates bbound(Xtilde) is much larger, which is reasonable: the subsets are computed and evaluated on covariates for which the accompanying response variables are not available.This collection of subsets sacrifices some predictive accuracy relative to the other search strategies, yet still outperforms the adaptive lasso-yet another testament to the importance of the regularization induced by M. The boxplots summarize the RMSPE quantiles for the subsets within each acceptable family, while the vertical lines denote RMSPEs of competing methods.The average size of each acceptable family is annotated.The proposed BBA search returns vastly more subsets that remain highly competitive, while the accompanying S small subset performs best overall.
Although the RMSPE differences appear to be small, minor improvements are practically relevant: even a single point on a standardized test score can be the difference between progression to the next grade level (on the low end) or eligibility for intellectually gifted programs (on the high end).More generally, education data are prone to weak signals and small effect sizes, which are conditions under which many methods may offer similar predictive performance.Nonetheless, our primary goal is not to substantially improve prediction, but rather to identify and analyze a large collection of near-optimal subsets.
The importance of the broader BBA search strategy is further highlighted in comparison lasso(bayes), which uses a lasso search path for decision analytic Bayesian variable selection (Hahn and Carvalho, 2015;Kowal et al., 2020).In particular, lasso(bayes) generates only |S| = 25 candidate subsets and |A 0,0.1 | = 9 acceptable subsets.By comparison, bbound(bayes) returns more than 100 times the number of candidate subsets and acceptable subsets.Figure 7 shows that the subsets omitted by lasso(bayes) yet discovered by bbound(bayes) are indeed near-optimal.Further, the restrictive search path of lasso(bayes) does not guarantee greater stability: the smallest acceptable subsets for bbound(bayes) and lasso(bayes) are nearly identical, yet the interquartile range of |S small | across the 20 training/testing splits under bbound(bayes) is 0, while the same quantity under lasso(bayes) is 2. Indeed, by this metric, bbound(bayes) actually provides the most stable smallest acceptable subset across all search strategies considered.

Discussion
We developed decision analysis tools for Bayesian subset search, selection, and (co-) variable importance.The proposed strategy is outlined in Algorithm 1. Building from a Bayesian predictive model M, we derived optimal linear actions for any subset of covariates.We explored the space of subsets using an adaptation of the branch-and-bound algorithm.After filtering to a manageable collection of promising subsets, we identified the acceptable family of near-optimal subsets for linear prediction or classification.The acceptable family was summarized by a new (co-) variable importance metric-the frequency with which variables (co-) appear in all, some, or no acceptable subsets-and individual member subsets, including the "best" and smallest subsets.Using the posterior predictive distribution from M, we developed point and interval estimates for the linear coefficients of any subset.Simulation studies demonstrated better prediction, interval estimation, and variable selection for the proposed approach compared to existing Bayesian and frequentist selection methods-even for high-dimensional data with p > n.
We applied these tools to a large education dataset to study the factors that predict educational outcomes.The analysis identified several keystone covariates that appeared in (almost) every near-optimal subset, including environmental exposures, economic and social factors, and demographic information.The co-variable importance metrics highlighted an interesting phenomenon where certain pairs of covariates belonged to many acceptable subsets, yet rarely appeared in the same acceptable subset.Hence, these variables are effectively interchangeable for prediction, which provides valuable context for interpreting their respective effects.We showed that the smallest acceptable subset offers excellent prediction of end-of-grade reading scores and classification of at-risk students using substantially fewer covariates.The corresponding linear coefficients described new and important effects, for example that greater racial residential isolation among non-Hispanic Black students is predictive of lower reading scores.However, our results also caution against overreliance on any particular subset: we identified over 200 distinct subsets of variables that offer near-optimal out-of-sample predictive accuracy.4. Collect the acceptable family of subsets that offer near-optimal prediction or classification; 5. Summarize the acceptable family via (co-) variable importance, the best subset, and the smallest subset.
6. Obtain posterior predictive samples of the coefficients δS for any subset S of interest.
Future work will attempt to generalize these tools via the loss functions and the actions.Alternatives to squared error and cross-entropy loss can be incorporated with an IRLS approximation strategy similar to Section 2.3, which would maintain the methodology and algorithmic infrastructure from the proposed approach.Similarly, the class of parametrized actions can be expanded to include nonlinear predictors, such as trees or additive models, with acceptable families constructed in the same way.
where often E ỹ|θ {ỹ(x i )} has an explicit form (such as in regression models).Absent such simplifications, the expectations can be computed by averaging the draws of ỹ−I k i ∼ p M (ỹ i |y −I k ).Although these terms can be computed by repeatedly re-fitting the Bayesian model M for each training/validation split k = 1, . . ., K, such an approach is computationally demanding.Instead, we apply a sampling-importance resampling (SIR) algorithm to approximate these out-of-sample quantities.Notably, the SIR algorithm requires only a single fit of model M to the complete data-which is already needed for posterior inference-and therefore contributes minimally to the computational cost of the aggregate analysis.
The details are provided in Algorithm 2. By construction, the samples {ỹ s i } S s=s 1 are from the out-of-sample predictive distribution p M (ỹ i |y −I k ).Based on Algorithm 2, it is straightforward to compute draws of D out S 1 ,S 2 for any S 1 , S 2 ∈ S, as well as the best subset δS min in (18).Therefore, the acceptable family A η,ε is readily computable for any η, ε.By default, we use S = S/2 SIR samples.

Appendix B. Simulation study for classification
The synthetic data-generating process for classification mimics that for prediction: the only difference is that the data are generated as For the Bayesian model M, we use a logistic regression model with horseshoe priors and estimated using rstanarm (Goodrich et al., 2018).The competing estimators are constructed similarly as before, now using cross-entropy loss (10) for the proposed approach and the logistic likelihood for the adaptive lasso.
The classification performance is evaluated using cross-entropy loss for π * i in Figure B.1 (top), using the same competing search methods as in Figure 1.As in the regression case, the proposed bbound(bayes) search procedure returns vastly more subsets in the acceptable family, yet maintains excellent classification performance within this collection.In addition, classification based on S min and S small substantially outperform the adaptive lasso.We also include the 90% interval comparisons in

Appendix C. Simulation study for sensitivity analysis
We revisit the comparisons in Figure 1 to consider sensitivities to ε and {x i } ñ i=1 .In Fig- ure C.1, we report results for the acceptable families with ε = 0.01 and ε = 0.2; the frequentist estimators and post.meanare unchanged from Figure 1.Compared also to the analogous case in Figure 1 with ε = 0.1, we see the expected ordering in the cardinalities of the acceptable families: larger values of ε produce a more stringent criterion and therefore fewer acceptable subsets.Notably, the main results are not sensitive to the choice of ε ∈ {0.01, 0.1, 0.2}, and S small performs exceptionally well in all cases.
Next, we again modify the comparisons in Figure 1 to evaluate prediction at a new collection of covariates {x i } ñ i=1 .We allow variations in the covariate data-generating processeither correlated standard normals from Section 3 or iid Uniform(0,1)-and whether the observed covariates {x i } n i=1 follow the same data-generating process as the target covariates {x i } ñ i=1 .These configurations generate four scenarios, all with {x i } ñ i=1 = {x i } n i=1 .The competing acceptable families are each generated using this choice of {x i } ñ i=1 and the point predictions are evaluated for y * (x i ) = x i β * for i = 1, . . ., ñ = 1000.The results are in Figure C.2. Naturally, performance worsens across the board when observed and target covariates differ in either their values ({x i } n i=1 = {x i } ñ i=1 ) or their distributions.When the target covariates are iid uniform (top right, bottom left), all methods actually perform better-likely due to the light-tailed (bounded) and independent covariate values.Yet perhaps most remarkably, the relative performance among the methods remains consistent, while S small outperforms all competitors in all but one scenario-and decisively outperforms the frequentist methods in all cases.S,Smin (xmarks) under cross-entropy for each subset size |S| with S ∈ S. We annotate S min (dashed gray line) and S small (solid gray line) and jitter the subset sizes for clarity of presentation.
In aggregate, these results suggest that near-optimal linear classification of at-risk students is achievable for a broader variety of smaller subsets of covariates.Lastly, the selected covariates from S small are plotted with 90% intervals in Figure D.3.Compared to the prediction version, S small for classification replaces 1st trimester PM 2.5 exposure with acute PM 2.5 exposure and omits the interactions Hisp × Blood lead and Hisp × EconDisadv, but is otherwise the same.The posterior expectations under M are excluded because they are not targeted for prediction of h(ỹ i ) and therefore are not directly comparable.However, there is some disagreement between S small and the adaptive lasso, the latter of which excludes several keystone covariates and again suffers from inconsistency between the point and interval estimates.

Figure 2 :
Figure 2: Mean interval widths (boxplots) with empirical coverage (annotations) for β * .Nonoverlapping notches indicate significant differences between medians.The proposed intervals based on S small are significantly narrower than the usual HPD intervals under M yet maintain the empirical nominal 90% coverage.
Air quality during gestation, life of the child, and at time of testing PM25_T1 PM2.5 24 hour -Average of gestation weeks 1-13 (trimester 1) --ugm3 PM25_T2 PM2.5 24 hour -Average of gestation weeks 14-26 (trimester 2) --ugm3 PM25_T3 PM2.5 24 hour -Average of gestation weeks 27-delivery (trimester 3) --ugm3 PM25_chronic PM2.5 Mean of 1 year prior to June 1 of the year of the EOG test PM25_acute PM2.5 Mean of 30 days prior to May 1 of the year of the EOG test Birth information BWTpct Birthweight Percentile (based on clinical estimate of gestation) mEdu Mother's Education Group at time of birth (NoHS = No high school diploma, HS = High school diploma, College = College diploma) mRace Mother's Race/Ethnicity Group (White = White, NH Black = Non-Hispanic Black, Hispanic = Hispanic)

Figure 4 :
Figure 4: The 80% intervals (bars) and expected values (circles) for D out Smin,S with D outS,Smin (xmarks) under squared error loss for each subset size |S| with S ∈ S. We annotate S min (dashed gray line) and S small (solid gray line) for ε = 0.10 and η = 0 and jitter the subset sizes for clarity of presentation.

Figure 5 :
Figure 5: Variable importance VI incl (j) for prediction.There are several tiers: variables appear in

Figure 6 :
Figure6: Estimated linear effects and 90% intervals for the variables in S small based on the proposed approach, model M, and the adaptive lasso.

Figure 7 :
Figure 7: Root mean squared prediction errors (RMSPEs) across 20 training/testing splits of the NC education data.The boxplots summarize the RMSPE quantiles for the subsets within each acceptable family, while the vertical lines denote RMSPEs of competing methods.The average size of each acceptable family is annotated.The proposed BBA search returns vastly more subsets that remain highly competitive, while the accompanying S small subset performs best overall.

Algorithm 1 :
Bayesian subset selection for interpretable prediction and classification 1. Fit a Bayesian predictive model M; 2. Specify a loss function for prediction or classification, including design points (covariate values) {x i } ñ i=1 and local weights {ω(x i )} ñ i=1 ; 3. Filter to a family of candidate subsets of covariates: (a) Screen to the best s max ≤ p covariates based on M; (b) Apply the branch-and-bound algorithm to select the best m k ≤ p k subsets of each size k = 1, . . ., p; Hence, it is required to compute (i) the optimal action on the training data, δ−I k S := arg min δ S E ỹ|y −I k L({ỹ i } i∈I k , δ S ), and (ii) samples from the out-of-sample predictive distribution ỹ−I k i ∼ p M (ỹ i |y −I k ).Because of the simplifications afforded by (4)-(5) and (11), computing δ−I k S only requires the out-of-sample expectations ŷ−I k i := E ỹ|y −I k {ỹ(x i )} or ĥ−I k i := E ỹ|y −I k h(ỹ i ) for the classification setting.For many models M, there is a further simplification that E ỹ|y

Algorithm 2 :
Out-of-sample predictive evaluations.1. Obtain posterior samples {θ s } S s=1 ∼ p M (θ|y); 2. For each training set k = 1, . . ., K: (a) Compute log w s k c = − log p M (y I k |θ s ) = − i∈I k log p M (y i |θ s ) (up to a constant); (b) Sample {s 1 , . . ., s S } without replacement from {1, . . ., S} with probability weights {w 1 k , . . ., w S k }; (c) Sample ỹs i ∼ p M (ỹ i |θ s) for s = s1 , . . ., s S and i = 1, . . ., n; j ∈ I k ; (e) Compute δ−I k S for each S ∈ S by solving (5) using the training data covariates {x i } i ∈I k , the weights {ω(x i )} i ∈I k , and the pseudo-data {ŷ −I k j } j ∈I k ; (f) Compute L out S (k) and { L out,s , . . ., S. Algorithm 2 recycles computations: steps 1-2(d) are shared among all subsets S and loss functions L. Hence, the algorithm is efficient even when the number of candidate subsets |S| is large-and notably does not require re-fitting the Bayesian model M. Modifications for classification (Section 2.3) are straightforward: steps (c), (d), and (e) are replaced by (c ) ỹs i ∼ p M (ỹ i |θ s) for s = s1 , . . ., s S and i = 1, . . ., n; (d ) ĥ−I k j ≈ S−1 S s=s 1 h(ỹ s j ) for j ∈ I k ; and (e ) compute δ−I k S by solving (11) using the training data covariates {x i } i ∈I k , the weights {ω(x i )} i ∈I k , and the pseudo-data { ĥ−I k j } j ∈I k .
Figure B.1 (bottom), which confirm the results from the prediction scenario: S small and Zhao et al. (2017) (modified for the logistic case) achieve the nominal coverage with the narrowest intervals.

Figure B. 1 :
Figure B.1: Top: cross-entropy loss for π * i .The boxplots summarize the cross-entropy quantiles for the subsets within each acceptable family, while the vertical lines denote cross-entropy of competing methods.The average size of each acceptable family is annotated.Bottom: Mean interval widths (boxplots) with empirical coverage (annotations) for β * .

Figure
Figure D.1: The 80% intervals (bars) and expected values (circles) for D out Smin,S with D outS,Smin (xmarks) under cross-entropy for each subset size |S| with S ∈ S. We annotate S min (dashed gray line) and S small (solid gray line) and jitter the subset sizes for clarity of presentation.
Figure D.3: Estimated linear effects and 90% intervals for the variables in S small based on the proposed approach and the adaptive lasso.

Figure E. 1 :
Figure E.1: Pairwise correlations among the covariates in the NC education data.

Figure E. 2 :
Figure E.2:Co-variable importance VI incl (j, ) for prediction.For certain pairs of variables (chronic and acute PM 2.5 exposure; neighborhood deprivation and racial residential isolation both at birth and time of test), it is common for one-but not both-to appear in an acceptable subset.

Table 1 :
True positive rates (TPR) and true negative rates (TNR) for Gaussian synthetic data with p * + 1 = 6 active covariates including the intercept.The selection performance of S small is excellent and similar to the the adaptive lasso.Classical subset, forward, and backward selection are too aggressive, while the posterior interval-based selection is too conservative.
Figure 3: Variables in the NC education dataset.Data are restricted to individuals with 37-42 weeks gestation, mother's age 15-44, Blood lead ≤ 80, birth order ≤ 4, no current limited English proficiency, and residence in NC at time of birth and time of 4th grade test.
Education / EOG test informationEOG_Reading Reading scale score for chronologically first EOG test taken (4 th grade)